Workflow-based systematic design of high throughput genome annotation
نویسنده
چکیده
The genus Eimeria belongs to the phylum Apicomplexa, which includes many obligate intracellular protozoan parasites of man and livestock. E. tenella is one of seven species that infect the domestic chicken and cause the intestinal disease coccidiosis which is economy important for poultry industry. E. tenella is highly pathogenic and is often used as a model species for the Eimeria biology studies. In this PhD thesis, a comprehensive annotation system named as “WAGA” (Workflow-based Automatically Genome Annotation) was built and applied to the E. tenella genome. InforSense KDE, and its BioSense plug-in (products of the InforSense Company), were the core softwares used to build the workflows. Workflows were made by integrating individual bioinformatics tools into a single platform. Each workflow was designed to provide a standalone service for a particular task. Three major workflows were developed based on the genomic resources currently available for E. tenella. These were of ESTs-based gene construction, HMM-based gene prediction and protein-based annotation. Finally, a combining workflow was built to sit above the individual ones to generate a set of automatic annotations using all of the available information. The overall system and its three major components were deployed as web servers that are fully tuneable and reusable for end users. WAGA does not require users to have programming skills or knowledge of the underlying algorithms or mechanisms of its low level components. E. tenella was the target genome here and all the results obtained were displayed by GBrowse. A sample of the results is selected for experimental validation. For evaluation purpose, WAGA was also applied to another Apicomplexa parasite, Plasmodium falciparum, the causative agent of human malaria, which has been extensively annotated. The results obtained were compared with gene predictions of PHAT, a gene finder designed for and used in the P. falciparum genome project.
منابع مشابه
Bioinformatics for plant genome annotation
High throughput sequencing must be matched by high throughput annotation. Given the large number of annotation tools available, a multitude of interdependent analyses are required for an in-depth annotation of even a single BAC sequence. Special annotation pipeline software is required to make such annotation processes feasible in an automated fashion. In terms of functionality, such software s...
متن کاملBambooGDB: a bamboo genome database with functional annotation and an analysis platform
Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facili...
متن کاملA Clustering Approach to Scientific Workflow Scheduling on the Cloud with Deadline and Cost Constraints
One of the main features of High Throughput Computing systems is the availability of high power processing resources. Cloud Computing systems can offer these features through concepts like Pay-Per-Use and Quality of Service (QoS) over the Internet. Many applications in Cloud computing are represented by workflows. Quality of Service is one of the most important challenges in the context of sche...
متن کاملCommunity annotation and bioinformatics workforce development in concert—Little Skate Genome Annotation Workshops and Jamborees
Recent advances in high-throughput DNA sequencing technologies have equipped biologists with a powerful new set of tools for advancing research goals. The resulting flood of sequence data has made it critically important to train the next generation of scientists to handle the inherent bioinformatic challenges. The North East Bioinformatics Collaborative (NEBC) is undertaking the genome sequenc...
متن کاملSNPbox: web-based high-throughput primer design from gene to genome
SNPbox is a modular software package that automates the design of PCR primers for large-scale amplification and sequencing projects in a standardized manner resulting in high-quality PCR amplicons with a low failure rate. Here, we present the SNPbox web server at http://www.SNPbox.org, which hosts the SNPbox web service as well as the data from SNPbox analysis of all Ensembl exons. The data of ...
متن کامل